Dominic Boccaleri Data Mining 1 Decision Tree/Naive Bayes

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import os

from sklearn import tree
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
c:\users\dominic\appdata\local\programs\python\python37\lib\site-packages\sklearn\externals\six.py:31: DeprecationWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", DeprecationWarning)
In [2]:
os.getcwd()
Out[2]:
'C:\\Users\\Dominic\\OneDrive\\DataMining'
In [3]:
data = pd.read_csv("cs-training.csv")
In [4]:
data.head()
Out[4]:
Unnamed: 0 SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0

Convert all variables into bins numerically

1 = Low 2= Middle 3 = High;;;; 1 = Low 2 = High;;;; 0 = None 1 = One

In [5]:
data.loc[data["NumberOfTime30-59DaysPastDueNotWorse"] <=10, "times_delinquent30-59"] = 2 
data.loc[data["NumberOfTime30-59DaysPastDueNotWorse"] <=3, "times_delinquent30-59"] = 1
data.loc[data["NumberOfTime30-59DaysPastDueNotWorse"] >10, "times_delinquent30-59"] = 3
In [6]:
data["times_delinquent30-59"].value_counts()
Out[6]:
1.0    148403
2.0      1324
3.0       273
Name: times_delinquent30-59, dtype: int64
In [7]:
data.loc[data["NumberOfTime60-89DaysPastDueNotWorse"] <=10, "times_delinquent60-89"] = 2
data.loc[data["NumberOfTime60-89DaysPastDueNotWorse"] <=3, "times_delinquent60-89"] = 1
data.loc[data["NumberOfTime60-89DaysPastDueNotWorse"] >10, "times_delinquent60-89"] = 3
In [8]:
data["times_delinquent60-89"].value_counts()
Out[8]:
1.0    149563
3.0       270
2.0       167
Name: times_delinquent60-89, dtype: int64
In [9]:
data.loc[data["NumberOfTimes90DaysLate"] <=10, "times_delinquent90"] = 2
data.loc[data["NumberOfTimes90DaysLate"] <=3, "times_delinquent90"] = 1
data.loc[data["NumberOfTimes90DaysLate"] >10, "times_delinquent90"] = 3
In [10]:
data["times_delinquent90"].value_counts()
Out[10]:
1.0    149127
2.0       588
3.0       285
Name: times_delinquent90, dtype: int64
In [11]:
data.loc[data["NumberOfDependents"] <=14, "catNumberOfDependents"] = 2
data.loc[data["NumberOfDependents"] <=2, "catNumberOfDependents"] = 1
data.loc[data["NumberOfDependents"] >4, "catNumberOfDependents"] = 3
In [12]:
data["catNumberOfDependents"].value_counts()
Out[12]:
1.0    132740
2.0     12345
3.0       991
Name: catNumberOfDependents, dtype: int64
In [13]:
data.loc[data["NumberRealEstateLoansOrLines"] <=10, "catNumberRealEstateLoansOrLines"] = 2
data.loc[data["NumberRealEstateLoansOrLines"] <=3, "catNumberRealEstateLoansOrLines"] = 1
data.loc[data["NumberRealEstateLoansOrLines"] >10, "catNumberRealEstateLoansOrLines"] = 3
In [14]:
data["catNumberRealEstateLoansOrLines"].value_counts()
Out[14]:
1.0    146348
2.0      3558
3.0        94
Name: catNumberRealEstateLoansOrLines, dtype: int64
In [15]:
data.loc[data["NumberOfOpenCreditLinesAndLoans"] <=10, "catNumberOfOpenCreditLinesAndLoans"] = 2
data.loc[data["NumberOfOpenCreditLinesAndLoans"] <=3, "catNumberOfOpenCreditLinesAndLoans"] = 1
data.loc[data["NumberOfOpenCreditLinesAndLoans"] >10, "catNumberOfOpenCreditLinesAndLoans"] = 3
In [16]:
data["catNumberOfOpenCreditLinesAndLoans"].value_counts()
Out[16]:
2.0    84940
3.0    43010
1.0    22050
Name: catNumberOfOpenCreditLinesAndLoans, dtype: int64
In [17]:
data.loc[data["MonthlyIncome"] <=800000, "catMonthlyIncome"] = 2
data.loc[data["MonthlyIncome"] <=30000, "catMonthlyIncome"] = 1
data.loc[data["MonthlyIncome"] >80000, "catMonthlyIncome"] = 3
In [18]:
data["catMonthlyIncome"].value_counts()
Out[18]:
1.0    119477
2.0       676
3.0       116
Name: catMonthlyIncome, dtype: int64
In [19]:
data.loc[data["DebtRatio"] <=.66, "catDebtRatio"] = 2
data.loc[data["DebtRatio"] <=.33, "catDebtRatio"] = 1
data.loc[data["DebtRatio"] >.66, "catDebtRatio"] = 3
In [20]:
data["catDebtRatio"].value_counts()
Out[20]:
1.0    68391
3.0    44404
2.0    37205
Name: catDebtRatio, dtype: int64
In [21]:
data.loc[data["age"] <=50, "catage"] = 2
data.loc[data["age"] <=30, "catage"] = 1
data.loc[data["age"] >50, "catage"] = 3
In [22]:
data["catage"].value_counts()
Out[22]:
3.0    79866
2.0    59376
1.0    10758
Name: catage, dtype: int64
In [23]:
data.loc[data["SeriousDlqin2yrs"] > 0, "catSeriousDlqin2yrs"] = 1
data.loc[data["SeriousDlqin2yrs"] < 1, "catSeriousDlqin2yrs"] = 0
In [24]:
data["catSeriousDlqin2yrs"].value_counts()
Out[24]:
0.0    139974
1.0     10026
Name: catSeriousDlqin2yrs, dtype: int64
In [25]:
data.loc[data["RevolvingUtilizationOfUnsecuredLines"] <=.6, "catRevolvingUtilizationOfUnsecuredLines"] = 2
data.loc[data["RevolvingUtilizationOfUnsecuredLines"] <=.3, "catRevolvingUtilizationOfUnsecuredLines"] = 1
data.loc[data["RevolvingUtilizationOfUnsecuredLines"] >.6, "catRevolvingUtilizationOfUnsecuredLines"] = 3
In [26]:
data["catRevolvingUtilizationOfUnsecuredLines"].value_counts()
Out[26]:
1.0    92882
3.0    35231
2.0    21887
Name: catRevolvingUtilizationOfUnsecuredLines, dtype: int64
In [27]:
df = data[["catMonthlyIncome", "times_delinquent30-59", "times_delinquent60-89", "times_delinquent90", 
           "catNumberOfDependents", "catNumberRealEstateLoansOrLines", "catNumberOfOpenCreditLinesAndLoans",
           "catDebtRatio", "catage", "catSeriousDlqin2yrs", "catRevolvingUtilizationOfUnsecuredLines"]]
In [28]:
df.head()
Out[28]:
catMonthlyIncome times_delinquent30-59 times_delinquent60-89 times_delinquent90 catNumberOfDependents catNumberRealEstateLoansOrLines catNumberOfOpenCreditLinesAndLoans catDebtRatio catage catSeriousDlqin2yrs catRevolvingUtilizationOfUnsecuredLines
0 1.0 1.0 1.0 1.0 1.0 2.0 3.0 3.0 2.0 1.0 3.0
1 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 2.0 0.0 3.0
2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 0.0 3.0
3 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 1.0 0.0 1.0
4 2.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 2.0 0.0 3.0
In [29]:
df.dropna()
Out[29]:
catMonthlyIncome times_delinquent30-59 times_delinquent60-89 times_delinquent90 catNumberOfDependents catNumberRealEstateLoansOrLines catNumberOfOpenCreditLinesAndLoans catDebtRatio catage catSeriousDlqin2yrs catRevolvingUtilizationOfUnsecuredLines
0 1.0 1.0 1.0 1.0 1.0 2.0 3.0 3.0 2.0 1.0 3.0
1 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 2.0 0.0 3.0
2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 0.0 3.0
3 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 1.0 0.0 1.0
4 2.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 2.0 0.0 3.0
... ... ... ... ... ... ... ... ... ... ... ...
149994 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 0.0 2.0
149995 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 3.0 0.0 1.0
149996 1.0 1.0 1.0 1.0 1.0 1.0 2.0 3.0 2.0 0.0 1.0
149998 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 1.0 0.0 1.0
149999 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 3.0 0.0 3.0

120269 rows × 11 columns

In [30]:
df = df.dropna()
In [31]:
X = df[["times_delinquent30-59", "times_delinquent60-89", "times_delinquent90", 
           "catNumberOfDependents", "catNumberRealEstateLoansOrLines", "catNumberOfOpenCreditLinesAndLoans",
           "catDebtRatio", "catage", "catSeriousDlqin2yrs", "catRevolvingUtilizationOfUnsecuredLines"]]
Y = df[["catMonthlyIncome"]]
In [32]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
In [33]:
clf = DecisionTreeClassifier()
In [34]:
clf = clf.fit(X_train, Y_train)
In [35]:
Y_pred = clf.predict(X_test)
In [36]:
print("Accuracy:", metrics.accuracy_score(Y_test, Y_pred))
Accuracy: 0.9932235802777085

Print the decision tree of the data.

In [ ]:
dot_data = StringIO()
In [ ]:
feature_columns = ["times_delinquent30-59", "times_delinquent60-89", "times_delinquent90", 
           "catNumberOfDependents", "catNumberRealEstateLoansOrLines", "catNumberOfOpenCreditLinesAndLoans",
           "catDebtRatio", "catage", "catSeriousDlqin2yrs", "catRevolvingUtilizationOfUnsecuredLines"]
In [ ]:
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True, 
                feature_names=feature_columns, class_names=["0","1","2"])
In [ ]:
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
In [ ]:
graph.write_png("MonthlyIncome.png")
In [ ]:
Image(graph.create_png())

Naive Bayes Theorem

In [37]:
#Import the Naive Bayes Model
from sklearn.naive_bayes import GaussianNB as gnb
from sklearn.naive_bayes import MultinomialNB as mnb
In [38]:
#Create a classifer to run the model Gaussian NB
modelgnb = gnb()
In [39]:
#Train the model GNB
modelgnb.fit(X_train,Y_train)
c:\users\dominic\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[39]:
GaussianNB(priors=None, var_smoothing=1e-09)
In [40]:
#Predict the response
y_predgnb = modelgnb.predict(X_test)
In [41]:
#Check the Accuracy
print("Accuracy:", metrics.accuracy_score(Y_test, y_predgnb))
Accuracy: 0.4238796042238297
In [42]:
#Create a classifier to run the model Multinomial NB
modelmnb = mnb()
In [43]:
#Train the model MNB
modelmnb.fit(X_train, Y_train)
c:\users\dominic\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[43]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [44]:
#Predict the response
y_predmnb = modelmnb.predict(X_test)
In [45]:
#Check the Accuracy
print("Accuracy:", metrics.accuracy_score(Y_test, y_predmnb))
Accuracy: 0.9933482996591003

Summary of the findings:

For the homework, I found a dataset with features of credit-style info like number of times a person was delinquent on their bills, how much a person makes per month, and how many inquiries a person has made in the last two years. The dataset was from kaggle.com (https://www.kaggle.com/c/GiveMeSomeCredit/overview). I did some preprocessing where I binned/ categorized the data as to not have so many data points. I decided to focus on the monthly income a person makes I made a decision tree based on the features of the data set and based on the training data (with a 20% split for testing) we predicted the monthly income with 99.3% accuracy. After the decision tree, I made a Naïve Bayes model using the Gaussian Naïve Bayes model. This particular model proved to be relatively worthless to use with this model and underperformed with only 42.4% accuracy. Then I tried a Multinomial Naïve Bayes model and it performed very well. It tested at 99.3% accuracy using the training split. Comparing the models, the decision tree and the Multinomial Naïve Bayes models performed extremely well with the data, but could possibly be due to some over preprocessing or over categorizing.

MonthlyIncome.png

This is the output .png image. This particular file would not make the file due to a issue with windows and the graphviz/pydotplus libraries. I was unable to get it to work but the code (which is labeled) does run when the libraries are installed properly on OSX and Anaconda.

What is the difference between Gaussian Naive Bayes and Multinomial Naive Bayes?

The type of Naive Bayes depends on which type of distribution is being used in the model, be it a Gaussian distribution or a Multinomial distribution.

The type of distribution refers to which distribution the model will use as the posterior function of the Bayesian model.

The posterior refers to the P(x|data)=P(data|x)*P(x)/P(data) and the distribution we choose will model the left side where we predict the responses given the data .